Exploratory data analysis (EDA)¶

InĀ [1]:
import wandb
import pandas as pd

run = wandb.init(project="nyc_airbnb", group="eda", save_code=True)
local_path = wandb.use_artifact("sample.csv:latest").file()
df = pd.read_csv(local_path)
wandb: Currently logged in as: tania-m. Use `wandb login --relogin` to force relogin
wandb version 0.16.2 is available! To upgrade, please run: $ pip install wandb --upgrade
Tracking run with wandb version 0.16.0
Run data is saved locally in /mnt/c/Users/Tania/Desktop/mlops-project2/build-ml-pipeline-for-short-term-rental-prices/src/eda/wandb/run-20240113_123350-pvlumhf6
Syncing run dark-dragon-11 to Weights & Biases (docs)
View project at https://wandb.ai/tania-m/nyc_airbnb
View run at https://wandb.ai/tania-m/nyc_airbnb/runs/pvlumhf6

General data profiling¶

InĀ [2]:
df.shape
Out[2]:
(20000, 16)
InĀ [3]:
# pandas_profiling was renamed to ydata_profiling
# import pandas_profiling
from ydata_profiling import ProfileReport

profile = ProfileReport(df, 
                        title="Profiling Report")
profile.to_notebook_iframe()
Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]
Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]
Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Data fixes¶

InĀ [4]:
# Drop outliers for prices
min_price = 10
max_price = 350
idx = df['price'].between(min_price, max_price)
df = df[idx].copy()

# Convert last_review to datetime
df['last_review'] = pd.to_datetime(df['last_review'])
InĀ [5]:
df.shape
Out[5]:
(19001, 16)
InĀ [6]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 19001 entries, 0 to 19999
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   id                              19001 non-null  int64         
 1   name                            18994 non-null  object        
 2   host_id                         19001 non-null  int64         
 3   host_name                       18993 non-null  object        
 4   neighbourhood_group             19001 non-null  object        
 5   neighbourhood                   19001 non-null  object        
 6   latitude                        19001 non-null  float64       
 7   longitude                       19001 non-null  float64       
 8   room_type                       19001 non-null  object        
 9   price                           19001 non-null  int64         
 10  minimum_nights                  19001 non-null  int64         
 11  number_of_reviews               19001 non-null  int64         
 12  last_review                     15243 non-null  datetime64[ns]
 13  reviews_per_month               15243 non-null  float64       
 14  calculated_host_listings_count  19001 non-null  int64         
 15  availability_365                19001 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(7), object(5)
memory usage: 2.5+ MB
InĀ [7]:
profile_for_cleaned_data = ProfileReport(df, title="Profiling Report after cleaning")
profile_for_cleaned_data.to_notebook_iframe()
Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]
Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]
Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]
InĀ [8]:
run.finish()
VBox(children=(Label(value='9.104 MB of 9.104 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))
View run dark-dragon-11 at: https://wandb.ai/tania-m/nyc_airbnb/runs/pvlumhf6
View job at https://wandb.ai/tania-m/nyc_airbnb/jobs/QXJ0aWZhY3RDb2xsZWN0aW9uOjEyOTgwOTk4NQ==/version_details/v5
Synced 7 W&B file(s), 0 media file(s), 3 artifact file(s) and 2 other file(s)
Find logs at: ./wandb/run-20240113_123350-pvlumhf6/logs